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Abstract 

We consider an interactive multiview video streaming (IMVS) system where clients select their preferred viewpoint 
in a given navigation window. To provide high quality IMVS, many high quality views should be transmitted to the 
clients. However, this is not always possible due to the limited and heterogeneous capabilities of the clients. In this 
paper, we propose a novel adaptive IMVS solution based on a layered multiview representation where camera views 
are organized into layered subsets to match the different clients constraints. We formulate an optimization problem 
for the joint selection of the views subsets and their encoding rates. Then, we propose an optimal and a reduced 
computational complexity greedy algorithms, both based on dynamic-programming. Simulation results show the 
good performance of our novel algorithms compared to a baseline algorithm, proving that an effective IMVS adaptive 
solution should consider the scene content and the client capabilities and their preferences in navigation. 

Keywords: Interactive multiview video, layered representation, navigation window, view synthesis. 


1. Introduction 

In emerging multiview video applications an array of cameras captures the same 3D scene from different view¬ 
points in order to provide the clients with the capability of choosing among different views of the scene. Intermediate 
virtual views, not available from the set of captured views, can also be rendered at the decoder by depth-image-based 
rendering (DIBR) techniques |[ll], if texture information and depth information of neighboring views are available. 
As a result, interactive multiview video clients have the freedom of selecting a viewpoint from a set of captured and 
virtual views that define a navigation window. The quality of the rendered views in the navigation window depends 
on the quality of the captured views and on their relative distance, as the distortion of a virtual view tend to increase 
with the distance to the views used as references in the view synthesis process This means that in the ideal case, 
all the captured views, encoded at the highest possible rate, would be transmitted to all the clients. However, in prac¬ 
tice, resource constraints prevent the transmission of all the views. In particular, clients may have different access link 
bandwidth capabilities, and some of them may not be able to receive all the captured views. In this context, it becomes 
important to devise adaptive transmission strategies for interactive multiview video streaming (IMVS) systems that 
adapt to the capabilities of the clients. 

In this work, we consider the problem of jointly determining which views to transmit and at what encoding rate, 
such that the expected rendering quality in the navigation window is maximized under relevant resource constraints. 
In particular, we consider the scenario illustrated in Fig. [1] where a set of views are captured from an array of time- 
synchronized cameras. For each captured view, both a texture and a depth map are available, so that intermediate 
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virtual viewpoints can eventually be synthesized. The set of captured and virtual views defines the navigation window 
available for client viewpoint request. Clients are clustered according to their bandwidth capabilities; for instance, in 
Fig. [T]only one client per cluster is illustrated for three groups with 50kbps, 5Mbps and 10Mbps bandwidth constraints. 
Then, the set of captured views are organized in layers or subsets of views to be transmitted to the different groups of 
clients in order to maximize the overall navigation quality. With a layered organization of the captured views in the 
navigation window, we aim at offering a progressive increase of the rendering quality, as the quality of the navigation 
improves with the number of layers (subset of views) that clients are able to receive. In the example of Fig. [1] three 
layers or subsets of views are formed as: Li - {vq, veK L,2 - {vq} and L3 = {vq, V3, vq}. Depending on the clients’ 
bandwidth capabilities, they receive the views in L\, or in Li and L2, or in L\, L2 and L 3 . In particular, the client with 
the lowest bandwidth capability (the client with a mobile phone) is able to receive only the subset of views {vi, v'g}, 
and needs to synthesize the rest of the views. On the other hand, the client with the highest bandwidth capability (the 
client with a TV), is able to receive all the views, and therefore reaches the highest navigation quality. 

We formulate an optimization problem to jointly determine the optimal arrangement of views in layers along with 
the coding rate of the views, such that the expected rendering quality is maximized in the navigation window, while 
the rate of each layer is constrained by network and clients capabilities. We show that this combinatorial optimization 
problem is NP-hard, meaning that it is computationally difficult and there are not known algorithm that optimally 
solves the problem in polynomial time. We then propose a globally optimal solution based on dynamic-programing 
(DP) algorithm. As the computational complexity of this algorithm grows with the number of layers, a greedy and 
lower complexity algorithm is proposed, where the optimal subset of views and their coding rates are computed 
successively for each layer by a DP-based approach. The results show that our greedy algorithm achieves a close-to- 
optimal performance in terms of total expected distortion, and outperforms a distance-based view and rate selection 
strategy used as a baseline algorithm for layer construction. 



Coding param. 
& Layer setup 


Figure 1: Illustration of an IMVS system with 6 camera views and 3 heterogeneous clients. The optimization is done 
by the layered representation creation module considering three layers. 

This paper is organized as follows. First, Section|2]discusses the related work. Then, the main characteristics of 
the layered interactive multiview video representation are outlined in Section [3 where also our optimization problem 
is formulated. Section |4] describes the optimal and greedy views selection and rates allocation algorithms for our 
layered multiview representation. Section |5]presents the experimental results that show the benefits of the proposed 
solution and the conclusions are outlined in Section |3 

2. Related work 

In this section, we review the work related to the design of IMVS systems by focusing on the problem of data 
representation and transmission in constrained resources environments. 
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In general, the limited bandwidth problem in IMVS has been approached by proposing some coding/prediction 
structure optimization mechanisms for the compression of multiview video data. In |[3l, 01 IS. the authors have 
studied the prediction structures based on redundant P- and DSC-frames (distributed source coding) that facilitate 
a continuous view-switching by trading off the transmission rate and the storage capacity. To save transmission 
bandwidth, different interview prediction structures are proposed in 0 to code in different ways each multiview 
video dataset, in order to satisfy different rate-distortion requirements. In 0, I and 0, a prediction structure 
selection mechanism has been proposed for minimal distortion view switching while trading off transmission rate and 
storage cost in the IMVS system. 

A different coding solution to the limited bandwidth problem has been proposed in where a user dependent 
midtiview video streaming for Multi-users (UMSM) system is presented to reduce the transmission rate due to re¬ 
dundant transmission of overlapping frames in multi-user systems. In UMSM, the overlapping frames (potentially 
requested by two or more users) are encoded together and transmitted by multicast, while the non-overlapping frames 
are transmitted to each user by unicast. Differently, the authors in lEl and ifTill tackle the problem of scarce transmis¬ 
sion bandwidth by determining the best set of camera views for encoding and by efficiently distributing the available 
bits among texture and depth maps of the selected views, such that the visual distortion of reconstructed views is 
minimized given some rate constraints. 

Although these works propose solutions to the constrained bandwidth problem in IMVS, they do not consider the 
bandwidth heterogeneity of the clients, and rather describe solutions that do not adapt to the different capabilities of 
the clients. 

The adaptive content concept in multiview video has been mostly used in the coding context, where the problem 
of heterogeneous clients has been tackled via scalable multiview video coding. For instance, some extensions of the 
H. 2 ^SVC standard d for traditional 2D video have been proposed in the literature for multiview video ifT^ ifT^ . 
Ind, d andid], the authors propose a joint view and rate adaptation solution for heterogeneous clients. Their 
solution is based on a wavelet multiview image codec that produces a scalable bitstream from which different subsets 
can be extracted and decoded at various bitrates in order to satisfy different clients bandwidth capabilities. 

In addition, multiview video permits the introduction of a new type of adaptive content compared to classical 
video. For instance, instead of transmitting the complete set of views of the multiview video dataset, some views can 
be omitted from the compressed bitstream and eventually reconstructed at the receiver side using DIBR method. The 
more the transmitted views are, the higher the reconstruction quality is but the larger the bitrate is too. This new typ e 
of adaptive content that trades off navigation quality and transmission bandwidth, has been studied in 0 and ifl^ . 
In these works, the set of captured views are organized in subsets, that we call layers, and they are transmitted to 
clustered heterogeneous clients according to their bandwidth capabilities. 

The work in IB, however does not optimize the set of views per layer, but rather distribute them based on a 
uniform distance criteria. The work in 0 optimizes the selection of views in particular settings, but does not adapt 
the coding rate of each camera view. In this paper, we build on these previous works and extend the state-of-the- 
art by proposing solutions for jointly optimizing the selection of the views and their coding rate in a layered data 
representation for IMVS systems, such that the expected rendering quality is maximized in the navigation window, 
while the rate of each layer is constrained by network and clients capabilities. To solve this problem, we propose an 
optimal algorithm and a greedy algorithm with a reduced complexity, both based on dynamic-programming. These 
algorithms adapt to the client capabilities and their preferences in navigation, to the camera positions and to the content 
of the 3D scene, in order to have an effective addaptive solution for IVMS systems. 


3. Framework and Problem Formulation 

We consider the problem of building a layered multiview video representation in an IMVS system, where the 
clients are heterogeneous in terms of bandwidth capabilities. In this section, we first describe the most relevant 
characteristics of the IMVS system. Then, we formally formulate our optimization problem. 

3.1. Network and IMVS model 

In this work, we define 'V = {t'l, V 2 ,..., vy] as the ordered set of captured views from an array of cameras. 
Each camera compresses the recorded view before transmitting it over the network. We assume that there is no 
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communication between the cameras, so each camera encodes its images independently of the other cameras, which 
is common in numerous novel applications ranging from surveillance to remote sensing. For each captured view 
Vj e "V, both a texture image (image containing color information) and a depth map (image containing distance 
information to objects in the 3D scene) are captured by the cameras, encoded and transmitted so that clients can 
eventually synthesize new images by DIBR techniques. 

A population of heterogeneous clients requests camera views from the IMVS system, such that they can freely 
navigate within the navigation window between views vq and vy. Navigation is achieved by rendering views in the 
navigation window possibly with help of DIBR methods. At the decoder side each client can reconstruct a view at any 
position in the discrete set 'Z/ = {vi, vi + d, • • • , V}; with 6 as the minimum distance between consecutive views in the 
navigation window. Due to resource constraints in practical systems, it is not possible to transmit all the camera views 
to all the clients. Then, clients are clustered according to their bandwidth capabilities and the set of views transmitted 
to each group of clients are carefully selected, so that their navigation quality is maximized. We propose a layered 
multiview representation with an optimal joint allocation of captured views and coding rates in each layer, which is 
common for all the clients. The optimization of the layered representation is done centrally by a coordinating engine 
that we denote as layered representation creation module (Fig[T]l. 

3.2. Layered multiview video representation model 

We give now some details on our layered multiview representation. The views'T', encoded at rate ?? = {ri, r 2 , ■ • ■ , ry}, 
are organized into layered subsets X. - {L\, - ■ ■ ,Lc] to offer a progressive increase of the visual navigation quality 
with an increasing number of layers. In particular, the finite set of cameras 'V is divided in C layers such that 
Li U L 2 U ... U Lc ^ 'y, with Li n Lj - tb,i + j. The number of layers C corresponds to the number of subsets of 
heterogeneous clients grouped according to their bandwidth capabilities. As a requirement, a client cannot decode a 
view in Lc without receiving the views in Lc-i, meaning that Li and Lq are the most and the least important subsets, 
respectively. This means that clients with very low bandwidth capabilities may only receive views in layer Li, and 
need to synthesize the missing viewpoints. On the other hand, clients with higher bandwidth capabilities receive more 
layers, which leads to a lower rendering distortion as the distance between reference views decreases, hence the view 
synthesis is of better quality. In addition, we denote by Xj = [Li,, • ■ • , Ld the layers between Li and L^ and by 
'T'j = l^vi, ■ ■ • , vy, • • ■ , vv] the ordered subset of views in 'V that includes only the views in £.\. Here, we assume 
that view synthesis with DIBR methods is done by using a right and left reference views. Therefore, the leftmost and 
rightmost views of the navigation window, iq and vy, need to be transmitted in layer Li. 

Formally, the quality of the interactive navigation when the views from the c most important layers are received 
and decoded can be defined as: 



( 1 ) 


where vi(u) and vv(m) are the closest right and left reference views to view u among the views in and du is the 
distortion of view u, when it is synthesized using v/(m) and v,.(m) as reference views, encoded at rates r^, in ??, for v in 
{v/(m), Vr{u)]. Finally, is the view popularity factor describing the probability that a client selects view u e Id for 
navigation. We assume that depends on the popularity of the views, due to the scene content, but it is independent 
of the view previously requested by the client. Note that, Dc > Dc+i, since each camera views subset or layer provides 
a refinement of the navigation quality experienced by the client. 

3.3. Problem Formulation 

We now formulate the optimization problem for the allocation of coded views in layers and their rate allocation 
in order to maximize the expected navigation quality for all IMVS clients. More specifically, the problem is to find 
the optimal subset of captured views in 'T' that should be allocated to each of the C layers X* = {Tj, • • • ,L'^} and 
coding rate of each selected view, 'R* = {r*, ■ • ■ , r^}, such that the expected distortion of the navigation is minimized 
for all the clients, while the bandwidth constraint per layer, S = {Z?i, ■ • ■ , Be], is satisfied. This bandwidth constraint 
is associated to the bandwidth capabilities of each clients cluster. The optimization of the number of layers and of 
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the rate constraints of the layers due to clients’ bandwidth capabilities is out of the scope of this paper. Formally, the 
optimization problem can be written as: 


such that. 


c 

min 

lxl=c,’|K|=y 


( 2 ) 


Vc6 {l,---,C} 

where p{c) stands for the proportion of clients that are able to receive up to layer Lc, namely clients with rate 
capability larger than Be but lower than Bc+i- The rate of each encoded view v in 'Vj is denoted by r,,, and the 
distortion Dc is given in Eq. [T] We finally assume that the depth maps are all encoded at the same high quality, as 
accurate depth information is important for view synthesis. In practice, the coding rate of depth maps is much smaller 
than the rate of the texture information, even when compressed at high quality. In the above problem formulation, the 
rate of encoded views can be formally written as - r[, + rf,, with r' as the rate of the texture information and rjf 
as the rate of the depth information of view v'. For the sake of clarity, and without loss of generality, we assume in 
the following that = r[„ due to the low rate contribution of the compressed depth maps compared with the texture 
information. 

3.4. NP-Hardness Proof 

We now prove that the optimization problem in (|2]i is NP-hard, by reducing it to a well-known NP-complete 
problem, the Knapsack problem. The Knapsack problem is a combinatorial problem that can be characterized as 
follows: 

Settings - Non-negative weights wi, W2, • • • , wy, profits ci, C2, • • • , cy, and capacity W. 

Problem - Given a set of items, each with a weight and a profit, find a subset of these items such that the corre¬ 
sponding profit is as large as possible and the total weight is less than W. 

We now consider a simplified instance of our problem in (|2]) and consider only one layer and a unique rate value 
for each captured view. Intuitively, if the problem is NP-hard for this simplified case it will also be NP-hard for 
the full optimization problem. We reduce this simplified problem from the Knapsack problem. First, we map each 
weight Wv to a view rate r^,. Then, when a view Vj is considered as a reference view for the corresponding layer, the 
profit is quantified by the distortion reduction that it brings, denoted here as 6{vj), where 6{vj) - — 

i)), for "Vi - ["Vj V;]. However, the profit ff(vj) of each view is not independent from the content of 
the layers, as it is the case for each object in the Knapsack problem. The profit depends on the views that have been 
already selected as reference views in the layer, meaning 'Vj. This increases the complexity of the view selection 
and rate allocation problem compared to the classic Knapsack problem. Therefore, if the problem is NP-hard when 
profits 0(\’j) are independent of the layer content, it will be NP-hard for our simplified problem. Then, assuming an 
independent profit for each view, our simplified problem can be rewritten as: 

Settings - Rates of the possible reference views ri,r 2 , ,ry, independent profits 0{vi),O(v2), ■ ■ ■ and 

bandwidth capacity Be. 

Problem - Given a set of views, each with a rate and a profit, find the subset of views such that the distortion 
reduction is as large as possible and the total rate is less than Be. 

This reduced problem is equivalent to the Knapsack problem. Hence, this proves that our original optimization 
problem is at least as hard as the Knapsack problem. Therefore, our problem in (|2]i is NP-hard. 

4. Proposed Optimization Algorithms 

To tackle the problem in (|2]i, we propose first an algorithm that solves the optimization optimally. Second, we 
present a reduced complexity algorithm that finds a locally optimal solution working on a layer by layer basis, with 
an average quality performance close to the optimal algorithm. 
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4.1. Optimal Algorithm 


To obtain an optimal solution to the problem in (l2]i, we propose a dynamic programming (DP) algorithm that 
solves problems by breaking them down in subproblems and combining their solutions. The subproblems are solved 
only once, and their solutions are stored in a DP table to be used in the multiple instances of the same subproblem 
1 201] . To develop a DP algorithm from the problem defined in (|2]i, we first need to identify the structure of the problem 
and how it can be decompose. We start with the following observations: 


1. Decomposition in the view domain - We first observe that the rendering quality Dc in Eq. ([T]i can be computed 
by parts. In particular, we can write: 


I'Vll-l 

J=i 


+ 1 ), 'Ri%u+ 1 ))) 


(3) 


where. 


Ac(V;c, Vj,, r^, ry) = ^ dAviiu), Vr(u)) (4) 

U=Vx, 

ri=K(vt) 

Vv,e{iv(H).Vr(«)) 

is the distortion of rendering views between views Vx and Vy, compressed at rates and ry respectively, using camera 
views v ;( m ) and Vr(u) as the closest left and right reference views of view u in 'T'j. In Q, I'T’jl stands for the size of the 
set ^j, meaning the number of views in Xj. Note that for the sake of clarity, we drop the parameters Xj and 
in the distortion A^, it is however clear that the distortion is only computed for a given layer representation and coding 
rates. 

Using the above observation, the subset of captured views {v;, v,, v^), used as reference views, with v; < v, < Vr, and 
encoded at rates {r/, r,, r^}, respectively, we can express the distortion Ac(v;, Vr, ri, as: 

Ac(vi, Vr, n, ry) = Ac(vi, Vi, ri, r,) + Ac(v/, Vy, n, ry) (5) 

This decomposition is possible as the distortion in a set of views only depends on the closest right and left reference 
views for each synthetic view in our system. 

2. Decomposition in the layer domain - Given a multiview layered representation of C layers, the expected 
distortion between reference views v/ and Vr, for the clients groups subscribed to data in layers c to C denoted here as 
<Pc ^1’ ^r) can be expressed as: 


c 

(j)^ {Vl, Vy, n, ry) = ^ p{l) AiiVl, Vy, T/, ry) 
i=c 

= p{c) Ac(vi, Vy, n, ry) + (vi, Vy, ri, ry) (6) 

As clients receiving higher layers need also to receive all the previous layers for optimal quality improvement, the 
reference views in layer c become available for any layer i > c. This means that the distortion difference for clients 
in layers c and c + 1 simply depends on the improvement provided by views in Ly+i. In other words, the expected 
distortion can be computed iteratively. Finally, note that (p^ {v\,vv,r\,ry) is actually the objective function of our 
optimization problem in Eq. (|2]i, where Ac(vi, vy,ri,ry) = Dc{£\,‘R{'V\)). 
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Now, let (v;, Vr, n, rr, ) be the minimum expected distortion (v;, Vr, ri, r^), when the rate budget for each 

layer and for the subset of camera views between v; and Vr is - [Bc,Bc+\, ■ • ■ ,Bc\ - [Bc Based on the 

above observations, Q and (|6]l, this minimum distortion can be defined in a recursive way as follows; 


of (v;, Vr, n, rr, fif) = 


mm 

V/<V/<Vr 


/7(c)Ac(v/, Vi, n, n) + of^i (v;, Vi, n, + of (v,-, Vr, n, rr, sf 




ri 

■c 

'c +1 


(7) 


The first term in (|7]i corresponds to the layer distortion between views v; and v,, as defined in (|4]i. The second 
term defines the minimum distortion between views v; and v, from layer c + 1 to layer C, when the rate constraint 
assigned to each layer is /if^j = [bc+i, ■ ■ • ,bc\ - [h^+i > 6 f+ 2 ]’ fory®f+i - ^f+r Finally, the third term is associated 
to the minimum expected distortion for clients receiving the views from layer c to C, between views v, and Vr when 

V' 1-^ 

■ Ths recursion relation in (|7]i directly carries the structure of our DP algorithm that 

Pc-¥\_ 

optimally solves the problem in Eq.(|2]i. The three terms in (|7]) are illustrated in Fig. I^for views from layer Lc to layer 
Tf+i- 


the rate constraint is Sf - 


JL 


Pc Ac 


-I- tit > 
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Figure 2: Illustration of the optimal algorithm with the three terms from (I?]) for views in layer Lc to layer Lc+\ that 
compose the recursive evaluation of Of [yi, Vr, ri, r^, Zlf 


To solve the problem in (|7]) using DP algorithm, we follow a bottom-up approach, where the “smaller” subprob¬ 
lems are solved first and their solution are used to solve “larger” subproblems. In particular the following steps are 
followed: 

1. Create a DP table of size V^{B„,axP^^, where B^ax is the maximum bandwidth constraint in Sf, and therefore 
the maximum possible rate value used to encode the camera views. 

2. Start from the last layer Lc, with v/ = vy-i and Vr - vy. Solve for every possible encoding rate of vy_i and vy 
and rate budget Bq- Store each solution in the {vy-i, vy, !R(vy_i), 7?(v'y), Be) entry. 

3. By progressively moving v/ towards vi, in the same layer, and towards the first layer, following a bottom-up 
approach, every time Of is called with the argument (v;, Vr, ri, rr, fif), the DP algorithm retrieves the value 
stored in the corresponding entry of the DP table. 

4. Stop when the DP table is completely filled. The optimal solution of the problem is done by comparing the 
optimal solution in (vi, vy, ri, rv,B\) entries. 

The complexity of the DP algorithm that implements (|2l) can be deduced from the size of the DP table that 
contains the solutions to the subproblems Of, that in this case is 0{V^{B„^x)^*~)- In addition, in order to compute 
each entry in the DP table, we use (|7]i where we need at most {V B,„ax)‘^^"'“ comparisons, corresponding to all possible 
Vi, ri for all possible rate constraints for the following layers CB„tax- Hence the total complexity of the algorithm is 

0^(v^{Bmax)^^^^ ° j, which is exponential with the number of layers C and the maximum number of rate values 

used to encode the camera views B,„ax- 
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4.2. Greedy Algorithm 

The computational time to solve the optimization problem with the above DP algorithm is exponential and rapidly 
grows with the number of available layers and coding rate values used to encode each of the captured views. Therefore, 
we propose a greedy approximate solution where the optimization problem defined in (l2]i is solved successively for 
each layer, starting from the first layer. When solving the optimization problem for each layer successively, the 
optimal reference views are selected from the full set of captured views when optimizing the first layer, while for the 
following layers, the solution is restricted to the views that have not been selected as reference views in the previous 
layers. However, the intuition behind this greedy algorithm is that, in our system, the lowest layers are necessary to 
most of the clients, for which our greedy algorithm tends to be close to optimal. Therefore, it is expected that our 
greedy algorithm leads to an effective solution in terms of overall expected distortion. Formally, the greedy algorithm 
considers the following optimization problem for each layer Lc'. 

mmp{c)D,U\,n^\)) ( 8 ) 

such that. 


ve'V'l 

where, “Re stands for the set of coding rates of the views selected as reference views in layer Lc. 

To obtain an approximate solution, meaning the optimal solution in each particular layer given the set of available 
reference views, we propose a dynamic programming (DP) algorithm that is inspired on the algorithm in Section im 
Let T'c (v/, vv, ri, ry. Be) be the minimum expected distortion between reference views v; and vy compressed at rate r; 
and ry, respectively, for clients subscribed to layer Lc. The remaining rate budget of Be is available for selecting new 
views in layer Lc between the given reference views v; and vy. This optimal solution is again a recursive function that 
finds the optimal {v;, r,}, with v/ < v; < vy, minimizing Ac(v/, v,, ri, r,) and the optimal solution T'c in the remaining set 
of views between v', and vy. This can be formally written as; 

'Lc (vi, Vy, ri, ry, Be) = min Advi, v,-, r;, r,) + (v;, Vy, n, ry. Be - ri) (9) 

0<n<B^ 

A DP algorithm implements the recursive formulation in (|9]l to determine the optimal allocation of views in layer 
Lc, given the allocation in the lower layers. The algorithm runs for each layer successively, starting from the first layer 
Ly Similarly to the optimal algorithm in Section I tTI a bottom-up approach is followed for each layer. In particular, 
the following steps are followed: 

1. Given layer Lc, create a DP table of size V(Bc)^, where Be is the rate constraint of the layer and therefore the 
maximum possible rate value to encode the selected reference views. Note that Vr = vy for every layer; therefore 
the size of the table regarding the number of views is only V, and not as for the optimal algorithm. 

2. Start with v/ = vy-i and solve (|9]l for every possible encoding rate of vy-i and vy and rate budget Be. Store the 
solution in the {vy-i, vy, R{vy-i), R{vy), Be) entry. 

3. By progressively moving v'/ towards v'l in the same layer, following a bottom-up approach, every time T'c is 
called with the argument (v/, vy, ri, ry. Be), the DP algorithm retrieves the value stored in the corresponding 
entry of the DP table. 

4. Stop when the DP table is completely filled or when v/ = vi. The optimal solution L*, of the problem is 
found by comparing the optimal solution in (vi, vy, ri, ry. Be) entries. 

5. If 'T’j c 'V, and Lc+i is available, move to the first step of the algorithm with Lc - Lc+i- 
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Figure 3: Illustration of our greedy algorithm with the two terms from (|9]l for views v/, v, and vy in layer L^. 


The different terms in (|9l) are illustrated in Fig. [3] for views vi, v, and vy in a particular layer Lc that recursively 
compute T'c (v;, vy, ri, ry. Be). 

Finally, following the same analysis in Section ItTI is the complexity of the DP algorithm 

in as the complexity in solving each of the layers is and the algorithm should run C times, 

one time for each layer. By solving every layer successively in the greedy algorithm, we are able to remove the 
exponential dependency with the number of layers, in the complexity of the algorithm; hence to seriously reduce the 
overall computational complexity of the optimal optimization algorithm. 


5. Performance Assessment 

This section presents the test conditions and performance results obtained in different scenarios when the search 
of the optimal subset of coded views per layer and rate allocation per view is performed with the algorithms proposed 
in this paper. We study the optimal allocation in different settings and compare it to the solution of a baseline camera 
distance-based solution. 


5.7. General Test Conditions 

We consider four different data sets for evaluating the performance of our optimization algorithms. We first study 
the performance on two multiview video datasets. Ballet (1024 x 768, 15Hz) 1211] and UndoDancer (1920 x 1080, 
25Hz) II 22 II . Though the main target of this work is on video delivery, we also consider multiview image datasets. 
Statue (2622 x 1718) and Bikes (2676 x 1752) due to the relatively high quality of their depth maps compared 
with the ones available in multiview video sequences. Multiview image experiments permits to appreciate the benefits 
of our solution in allocating resources based on scene content properties. The 3D-HEVC reference software HTM 6.2 
1 241] has been used to encode jointly texture and depth maps in each dataset. The views are encoded independently 
and temporal prediction is used for each view in the video sequences. The depth maps are encoded at high quality 
(we set a quantizer scale factor of QP=25 for the depth maps), while a set of different rate values p is considered for 
encoding the texture information. For each sequence, the following conditions have been considered; 


• Statue - V - 1 captured views and 7/ = 10 equally spaced rendered views. In this dataset, the cameras 
are horizontally arranged with a fixed distance between neighboring cameras of 5.33mm. We have cho¬ 
sen the ten available views to have a separation of at least 26.65mm between pair of views, such that U - 
{50 55 60 65 70 75 80 85 90 95} and ^ = {50 55 65 70 80 85 95}, in terms of view indexes in the dataset. 

• Bikes -V - 7 and U - 7 captured and rendered views, respectively. In this dataset, the cameras are horizontally 
arranged with a spacing of 5mm. As for Statue dataset, to increase the distance between available views, we 
have chosen the available views by fixing the minimum distance between views to be 25mm. In detail, the seven 
views correspond to the views 'V = 'Z/ = {10 20 25 30 35 40 50}, in terms of dataset indexes. 

• Ballet - V - 7 captured views and U - % rendered views. The views follow a circular arrangement and 
correspond to'T' = {0 124567} and '77 = {0 1234567}, regarding the view indexes in the dataset. 

• UndoDancer - V = 5 captured views and U - 9 equally spaced rendered views. The cameras for this sequence 
are horizontally arranged with a fixed distance of 20 cm between neighboring views. They correspond to the 
captured views 'V = {1 2 3 5 9} and the nine rendered views 'Z7 = {1 2 3 4 5 6 7 8 9}, in terms of dataset 
indexes. 
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The distortion of the synthesized view u at the decoder depends on the quality of the reference views used for 
synthesis, namely v ;( m ) and Vriu), and on their distance to the synthesized view. For the simulations, we use a distortion 
model proposed in our previous work ||2l, which considers these two factors in estimating the distortion of the synthetic 
view du as: 


du(vi, Vr) = (1 - O') [dv'^(Vi, Vr) + d^^^(vi, w)) + (I - y)a ((fv/ (v;, Vr) + d^M^vi, Vr)) + yal (10) 

where, dy< and for i e ( 1 , 2 ), denote the average distortion per pixel for the texture and the depth map of the 
first and second views that are used as references for view synthesis, where v, e {v;(m), Vr(u)}. Note that, for the sake 
of clarity, we have dropped the parameter u from v;(m) and Vr(u), when it is clear that they refer to the left and right 
reference view for view u in (fTOl) . The parameters a and j are respectively the proportion of disoccluded pixels in the 
projection of the first reference view and in the projections of both reference views in the DIBR view synthesis. Their 
values depend only on the scene geometry and they are obtained from the depth maps of the reference views. Finally, 
the average distortion per pixel in the inpainted areas is denoted by I, which is assumed to take a constant value that 
only depends on the scene content. 

In the rest of this section, we carry out simulations for different system settings to evaluate the performance of 
our greedy and our optimal algorithms presented in Sections [4. Il and l4~2] We compare their performance to those of a 
baseline algorithm, which selects a subset of coded views per layer such that the average distance between reference 
and synthetic views is minimized in each layer. 

5.2. Greedy vs. Optimal algorithm 

In this section, we compare the performance of both the optimal and greedy algorithms proposed in Sections [4. II 
and 14.21 Due to the exponential complexity of our optimal algorithm, a small discrete set of available rates p to 
encode the texture information is used and only two layers are considered in the layered multiview representation, 
which means that the clients are clustered in only two groups depending on their bandwidth capabilities. 

We consider two different distributions for the proportion of clients that subscribe to each layer. In particular, we 
set p - [0.5 0.5], when the first half of the clients can only get L\ and the second half get both Li and L 2 , and we 
set p - [0.1 0.9], when most of the clients have high bandwidth capabilities and only 10% of them can only get the 
first layer, Li. We also assume that all the views in U have the same probability of being requested, which results in 
a uniform view probability distribution q. 

The results are presented in Table [1] where the set of views per layer X* and the expected distortion D, defined 
as p(c)Dc, with Dc in ([TJ, are shown for each considered data set. The rate constraint per layer Be and the set 
of available rates p to encode the texture information for each of the considered datasets are given in Table [1] The 
views selected by each algorithm in each layer are given in terms of the rate, Lc - (ri, ■ ■ ■ ,ry, - ■ ■ ry], where r^ - 0 
means that the view is not transmitted in that particular layer and r^ > 0 means that the view is encoded at rate r^ in 
the corresponding layer. The indexes of the views correspond to the views arrangement in the set of captured views 

It can be seen from the results in Table[T]that the same optimal set of views per layer X* has been chosen for both 
the greedy and optimal solutions when a uniform distribution of p for the clients is assumed. The same results have 
been obtained for values of p{l) higher than 0.5, but they are not presented here due to space restrictions. When p{2) 
increases, meaning that the second layer Ln is transmitted to a larger group of clients, the greedy algorithm shows 
its sub-optimality. For instance, when p{2) - 0.9 the optimal solution is not obtained by the greedy algorithm for 
the Bikes dataset; instead, the same X* solution as for pil) - p{2) = 0.5 is computed. This sub-optimality is due to 
the fact that, in our greedy algorithm, the problem is solved successively for each layer, starting from layer Li. This 
means that the optimal solution X* does not depend on the probability distribution p of clients requesting each layer. 
Therefore, the solution X* for each dataset is the same for any distribution p; it only affects the expected distortion 
D. This successive approach of our greedy algorithm also means that the first layer is prioritized, where the layer Li 
always has an optimal set of views independently of the other layers. This explains the good performance of the greedy 
algorithm when the first layer has high probability of being transmitted alone, i.e., high value of p(l). Nevertheless, 
even when the second layer L 2 is transmitted to a larger group of clients, p{2) = 0.9, the greedy algorithm shows a 
good performance, presenting an optimal solution for three of the four datasets considered in our experiments. This 
good performance of the greedy algorithm can be explained by the fact that the first layer is always received by all 
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Table 1: Comparison of the optimal and greedy algorithms in terms of view selection and rate allocation X* and 
average distortion D. 


Sequence 

& Settings 

Client distribution 

P 

Optimal 

X* 

D(dB) 

Greedy 

X* 

D{dB) 

Statue 

Be = 8Mb 

[0.5 0.5] 

Li = {2020202} 

L2 = 10 2 0 2 0 4 0) 

29.56 

Li = [2020202} 

L2 = (0 2 0 2 0 4 0} 

29.56 

p = |0 2 4|Mb 

[0.1 0.9] 

Li = [2 0 2 0 2 0 2} 

L2 = [0 2 0 2 0 4 0} 

30.43 

Li = [2020202} 

L2 = (0 2 0 2 0 4 0} 

30.43 

Bikes 

Be = 8Mb 

[0.5 0.5] 

Li = [1.5 1.5 0 20 1.5 1.5} 

L2 = |0 0 2 0 2 0 0} 

29.00 

Li = [1.5 1.5 02 0 1.5 1.5} 

L2 = [0 0 2 0 2 0 0} 

29.00 

p = |0 1.5 2)Mb 

[0.1 0.9] 

Li = [2020202} 

L2 = 10 2 0 2 0 2 0} 

31.19 

Li = [1.5 1.5 02 0 1.5 1.5} 

L2 = [0 0 2 0 2 0 0} 

29.48 

Ballet 

Be = 1 Mb 

[0.5 0.5] 

Li =[0.3 0 0 0.3 0 0 0.3] 

L2 = [0 0.3 0.3 0 0.3 0 0} 

37.33 

Li = [0.3 0 0 0.3 0 0 0.3] 

L2 = [0 0.3 0.3 0 0.3 0 0] 

37.33 

p = 10 0.25 0.3 )Mb 

[0.1 0.9] 

Li =[0.3 0 0 0.3 0 0 0.3] 

L2 = [0 0.3 0.3 0 0.3 0 0} 


Li = [0.3 0 0 0.3 0 0 0.3] 

L2 = [0 0.3 0.3 0 0.3 0 0] 

39.27 

UndoDancer 

Be = 2 Mb 

[0.5 0.5] 

Li = (0.5 0 0 1 0.5} 

L2 = 10 1 1 0 0} 

29.41 

L| = [0.5 0 0 1 0.5] 

L2 = [0 1 1 0 0] 

29.41 

p = 10 0.5 l)Mb 

[0.1 0.9] 

Li = [0.5 0 0 1 0.5} 

L2 = 10 1 1 0 0} 

29.93 

L, = [0.5 0 0 1 0.5] 

L2 = [0 1 1 0 0] 

29.93 


the clients, independently of the probability distribution p of clients requesting each layer. Therefore, optimizing the 
allocation of the views in the first layer is never really bad, which further justifies the design of our greedy algorithm. 
In addition, it has a lower complexity compared with the optimal algorithm, as demonstrated in Section IXSl Therefore, 
for the rest of the paper we only consider the greedy algorithm and we compare it with a baseline solution for view 
selection and rate allocation. 


5.3. Greedy algorithm performance 

After showing the good performance of our greedy algorithm in the previous section, we now study its performance 
in different scenarios and compare it with a baseline algorithm, namely distance-based view selection solution il9n . 
In this algorithm, the views in each layer are selected such that the distance between encoded and synthesized views 
is minimized. Views are encoded at the same rate in each layer and the rate per view and the number of views are 
chosen such that the available bandwidth per layer is used to its maximum. Layers are filled in successive order, as 
for our greedy algorithm. 

The algorithms are compared in different settings where the layer rate constraint and view popularity effects are 
evaluated. A total of four layers are considered in all the simulations presented in this section, representing four 
groups of clients that are clustered depending on their bandwidth capabilities. Note that, since we do not consider our 
optimal algorithm in these simulations, we are able to increase the set of available coding rates p for each dataset and 
the number of layers in the multiview layered representation, compared with the experiments in Section l572l 


5.3.1. Layer rate constraint variations 

In this subsection the greedy algorithm is compared with the distance-based solution in terms of the expected 
distortion when varying the layer rate constraint. We use an illustrative layer rate distribution that follows a linear 
relationship: Be = x x Lc -i- y. By varying the values of x and y, we can study the performance of the view selection 
algorithm in different settings. The corresponding results are presented in Tabled where the solution from the greedy 
algorithm outperforms the distance-based solution in terms of the expected distortion D in 4 out of 6 experiments. On 
the other two cases, the same result is obtained by both algorithm. The performance gain obtained with our greedy 
algorithm is mainly due to its rate allocation capability compared to the homogeneous rate assignment in the distance- 
based algorithm. The non-uniform rate allocation of our greedy algorithm permits the fully use of the available 
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Table 2: Comparison of the greedy and distance-based algorithm for different layer rate constraints. 


Sequence 

& Settings 

Rate constraints 

fx, yl 

Greedy 

r D(dB) 

Distance-based 

X* D(dB) 

Bikes 

p = |0 1 1.5 

2 2.5 2.7| Mb 

12 , 2 ) 

Li = {2 00 0 00 2 ) 

L 2 = {0 1.5 1.5 0 1.5 1.5 0) 28.31 

L 3 = {0 0 0 2.7 0 0 0) 

Li = (2 0 0 000 2 ) 

L 2 = (0 1.5 1.5 0 1.5 1.5 0) 28.31 

L 3 = (0002.7000) 

{0.5, 4) 

Li = {1.5 0 0 1.5 00 1.5) 

L 2 = {0 2 00 1.5 1.5 0) 28.24 

L 3 = {0 0 2.7 0 0 0 0) 

Li = (1.5 0 0 1.5 0 0 1.5) 

L 2 = (0 1.5 0 0 1.5 1.5 0) 27.92 

L 3 = (0002.7000) 

Ballet 

p = 10 0.15 0.18 

0.20 0.25 0.3) Mbps 

(0.25, 0.25) 

Li ={0.15 0 0 0.2 0 0 0.15) 

L 2 = {0 0.2 0.25 0 0 0.3 0) 37.33 

L 3 = {0 0 0 0 0.3 0 0) 

Li = (0.25 0 0 0 0 0 0.25) 

L 2 = (0 0 0.25 0.25 0 0.25 0) 36.09 

L 3 = (0 0.3 0 0 0.3 0 0) 

10 . 2 , 0 . 1 ) 

Li ={0.15 00 0 000.151 

L 2 = {0 0.25 0 0.25 0 0 0) 34.91 

L 3 = {0 0 0.3 0 0 0.3 0) 

L 4 = {0 0 0 0 0.3 0 0) 

Li ={0.15 0 0 0000.151 

L 2 = (0 0 0.25 0 0.25 0 0) 34.60 

L 3 = (0 0.3 0 0 0 0.3 0) 

L 4 = (0 0 0 0.3 0 0 0) 

UndoDancer 

p = |0 0.25 0.5 

0.75 1 1.25) Mbps 

10.5,0.5) 

Li ={0.5 0 0 00.5) 

L 2 = {0 0 0.5 1 0) 28.64 

L 3 = {0 1.25 0 0 0) 

Li =(0.5 0 000.5) 

L 2 = (0 0 0.75 0.75 0) 28.60 

L 3 = (0 1.25 0 0 0) 

{0.25, 0.75) 

Li ={0.5 0 0 00.5) 

L 2 = {0 0 0 1.25 0) 28.26 

L 3 = {0 0.75 0.75 0 0) 

Li ={0.5 0 000.5) 

L 2 = (0 0 0 1.25 0) 28.26 

L 3 = (0 0.75 0.75 0 0) 


rate per layer, allocating more bits to views used as references in the view synthesis process; e.g. for layer L 2 with 
UndoDancer sequence when {x y] - {0.5 0.5}. In these tests, a distance-based view selection solution shows to be 
relatively close to the optimal solution, where most of the selected views in each layer are almost equally spaced. This 
is due to the small change in content among different views, which is due to the small distance between the cameras 
and/or the low scene complexity in most of the available datasets. Nevertheless, these experiments have shown that a 
simple distance-based solution with a uniform rate allocation among the selected views in each layer, is not ideal as it 
cannot take into account the actual content of the scene, contrarily to our algorithm. 

5.3.2. View popularity distribution variations 

Now we compare our greedy algorithm with the distance-based solution when views have different popularities. 
The results are shown for an exponential popularity distribution, where the leftmost and rightmost views in the set of 
captured views 'V are the most and the least popular view, respectively. The results are presented in Tabled where the 
optimal set of views per layer X* and the total expected distortion D are shown for the greedy and distance-based solu¬ 
tions. The settings for the different sequences are specified in the Table[3] The total expected distortion D is calculated 
assuming a uniform distribution of the proportion clients accessing each layer, meaning p = [0.25 0.25 0.25 0.25], for 
the four layers. The results show that the solution from the greedy algorithm outperforms the distance-based solution 
in terms of the total expected distortion. This is due to the fact that the distance-base solution does not consider neither 
the popularity distribution of the views nor an optimized rate allocation among the views. In particular, in the greedy 
algorithm the views close to the leftmost view (the most popular views) are selected in the first layers, to ensure that 
most of the clients receive the most popular views and therefore enjoy a higher expected navigation quality. Similar 
conclusions can be drawn when considering other view popularities distributions. 

An alternative presentation of the gain of our greedy algorithm is shown in Fig. 0] A bar plot illustrates the 
expected quality (Y-PSNR) of our greedy algorithm (GA) and of the distance-based approach (DBA) for the four 
considered layers in these simulations. We consider the Bikes and UndoDancer datasets, with the same settings as 
the ones of the results in Table [2 In addition, we have included horizontal lines representing the average quality of 
each algorithm across the whole client population (the four client clusters), using the same bar color. The distortion is 
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calculated with the views received in the current layer and in all the previous layers, as clients subscribed to a particular 
layer receive all the views up to that layer. Therefore, for both approaches, the overall quality increases as the layer 
index increases since clients are able to receive more views. Note that, in general, our greedy algorithm outperforms 
the distance-based approach, achieving the highest average quality. In the case of the UndoDancer sequence, we can 
see however that the group of clients receiving up to layer L 4 enjoy a slightly higher quality with the distance-based 
approach than with the greedy algorithm. This is due to the fact that in layer L 4 all the reference views are selected and 
most of them are encoded at the highest possible rate for the distance-based approach, as it was the only option for the 
algorithm to fully use the available bandwidth and have a uniform rate allocation among the selected views. However, 
this view and rate selection of the distance-based solution only favors clients in the last cluster (highest bandwidth 
capabilities). In fact, the overall performance for the UndoDancer sequence is better for our greedy algorithm, as for 
the first layers the view selection and rate allocation offer a higher quality to the first three group of clients. 

Table 3: Greedy and distance-based solutions comparison for an exponential view popularity distribution. 


Sequence 

& Settings 

Greedy 

X* T){dB) 

Distance-based 

X* D{dB) 

Statue 

fic = 8 Mb 

p = {0 2 4 

5 6 8 | Mb 

L| = {4 0 2 0 0 0 0 2) 

L 2 = {0 4 0 0 0 4 0 0) 34.56 

L 3 = {0 0 0 4 4 0 0 0) 

L 4 = {0 0 0 0 0 0 4 0) 

Ll = {4 0 0 0 0 0 0 4) 

L 2 = {0 0 4 0 0 4 0 0) 34.35 

L 3 = {0 0 0 4 4 0 0 0) 

L 4 = {0 4 0 0 0 0 4 0) 

Bikes 

Be = 3.5Mb 
p = |0 1 1.5 

2 2.5 2.7| Mb 

Li = {2 0 0 000 1.5) 

L 2 = {0 1.5 2 0000) 28.61 

L 3 = {00 0 2 1.5 0 0) 

L 4 = {0 0 0 0 0 2.7 0) 

L, = {1.5 00 0 00 1.5) 

L 2 = (0 0 1.5 0 1.5 0 0) 26.49 

Li = {0 1.5 000 1.5 0) 

L 4 = {0 0 0 2.7 0 0 0) 

Ballet 

Be = 0.5Mb 

p= {0 0.15 0.18 

0.20 0.25 0.3| Mbps 

Li = {0.25 0 0 0 0 0 0.25) 

L 2 = {0 0 0.3 0.2 0 0 0) 37.78 

L 3 = {0 0.3 0 0 0.2 0 0) 

L 4 = {0 0 0 0 0 0.3 0) 

Ll = {0.25 0 0 0 0 0 0.25) 

L 2 = {0 0 0.25 0 0.25 0 0) 37.54 

Li = {0 0.25 0 0 0 0.25 0) 

L 4 = {0 0 0 0.3 0 0 0) 

UndoDancer 

Be = 1.25Mb 
p = {0 0.25 0.5 

0.75 1 1.25) Mbps 

Li = {0.75 0000.5) 

L 2 = {0 0.75 0 0.5 0) 30.19 

L 3 = {0 0 1.25 0 0) 

L 4 = {0 0 0 0 0 ) 

Ll ={0.5 0 0 00.5) 

L 2 = {0 0 0 1.25 0) 29.66 

Li = {0 0 1.25 0 0) 

L 4 = {0 1.25 0 0 0) 



(a) (b) 


Figure 4: Expected Y-PSNR(dB) per client cluster receiving Xf, for Bikes (a) and UndoDancer (b) datasets when 
comparing our greedy algorithm (GA) and the distance-based algorithm (DBA). 
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6. Conclusion 


We have proposed a novel adaptive transmission solution that jointly selects the optimal subsets of views streams 
and rate allocation per view for a hierarchical transmission in IMVS applications. We consider a system where the 
network is characterized by clients with heterogeneous bandwidth capabilities, and we aim to minimize their expected 
navigation distortion. To do so, clients are clustered according to their bandwidth capabilities and the different camera 
views are distributed in layers to be transmitted to the different groups of users in a progressive way, such that the 
clients with higher capabilities receive more layers (more views), hence benefiting of a better navigation quality. We 
have formulated an optimization problem to jointly determine the optimal arrangement of views in layers along with 
the coding rate of the views, such that the expected rendering quality is maximized in the navigation window, while 
the rate of each layer is constrained by network and clients capabilities. To solve this problem, we have proposed 
an optimal algorithm and a greedy algorithm with a reduced complexity, both based on dynamic-programming. It 
has been shown through simulations that the proposed algorithms are able to reduce the navigation distortion in a 
IMVS system. In addition, our greedy algorithm has close-to-optimal performance and outperforms a distance-based 
algorithm based on an equidistant solution with an uniform rate allocation among the selected views in each layer. 
Our results show that, considering the client capabilities and their preferences in navigation, and the 3D scene content, 
as the proposed optimal and greedy algorithms do, is key in the design of an effective adaptive transmission solution 
for IMVS systems. 
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